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Abstract. Each time-series has its own linear trend, the directionality of a time- 
series, and removing the linear trend is crucial to get the more intuitive matching 
results. Supporting the linear detrending in subsequence matching is a challenging 
problem due to a huge number of possible subsequences. In this paper we define 
this problem the linear detrending subsequence matching and propose its efficient 
index-based solution. To this end, we first present a notion of LD-windows (LD 
means linear detrending), which is obtained as follows: we eliminate the linear 
trend from a subsequence rather than each window itself and obtain LD-windows 
by dividing the subsequence into windows. Using the LD-windows we then present 
a lower bounding theorem for the index-based matching solution and formally 
prove its correctness. Based on the lower bounding theorem, we next propose 
the index building and subsequence matching algorithms for linear detrending 
subsequence matching. We finally show the superiority of our index-based solution 
through extensive experiments. 

Key words: data mining, time-series databases, similar sequence matching, lin- 
ear detrending, subsequence matching 

1 Introduction 

Time-series data are of growing importance in data mining and data warehous- 
ing .7,10, . A time-series is a sequence of real numbers representing values at 
specific points in time. Typical examples include stock prices, music data, net- 
work traffic data, moving object trajectories, and biomedical data [2131519116] . 
The time-series data stored in a database are called data sequences, and those 
given by users are called query sequences. Finding data sequences similar to 
the given query sequence from the database is called similar sequence matching 
or time-series matching^. In many similar sequence matching models, two se- 
quences X = {X[l], . . . ,X[n\} and Y = {Y[l], . . . ,Y[n]} are said to be similar 
if the distance D{X, Y) < e, where e is the user-specified tolerance. In this paper 
we use the Euclidean distance {— \/J27=i l-^i"^] ^ ^MP) the distance function 
of ^(X, Y). 

Linear trend, a representative distortion of time-series data [9112] , shows the 
directionality of a time-series, and linear detrending in similar sequence matching 
is crucial to get the more intuitive matching results. Figure [T] shows an example 
of comparing two sequences before and after linear detrending: Figure[lja) repre- 
sents the original sequences Q and S; Figure [TJb) the linear detrended sequences 



Q' and S' . We obtain Q' and S' by linear detrending, i.e., by subtracting the 
corresponding trend lines f{Q) and f{S) from Q and S, respectively. In Figure 
[TJ there is a big distance between Q and S, and these two sequences will be 
determined to be non-similar. In contrast, the distance between Q' and S' is 
very small in Figure [Ijb), and they will be determined to be similar. It means 
that non-similar sequences can be identified as similar ones after linear detrend- 
ing, and vice versa. Likewise, linear detrending is useful to know similarity of 
changes which is hidden by the linear trend of time-series data |8I9| . Motivated 
by this example, we attack the problem of linear detrending in similar sequence 
matching, especially in subsequence matching [5113] . 



Fig. 1. Comparison of two sequences S and Q before and after linear detrending. 



In this paper we address the problem of linear detrending in subsequence 
matching. Supporting the linear detrending is simple in whole matching since 
all data and query sequences have the same length. But, it is a challenging 
problem in subsequence matching because we need to consider a huge number of 
possible data subsequences to be linear detrended. We call this matching scheme 
the linear detrending subsequence matching. Formally speaking, for a query se- 
quence Q and a data sequence S, linear detrending subsequence matching finds 
all subsequences S[i : j\ such that D{Q, S[i : j]) < e, where Q and S[i : j] are 
the linear detrended (sub) sequences of Q and S[i : j], respectively. 

We propose an index-based solution for linear detrending subsequence match- 
ing. To this end, we first present a novel notion of LD-windows, linear detrending- 
windows. Suppose a subsequence S[i : j] include a window S[a : b] (i.e., i < a < 
b l£ j): then we obtain the LD-window of S[a : b] by eliminating the linearity 
of subsequence S[i : j] rather than that of window S[a : b] itself. This notion 
enables an LD-window to represent multiple subsequences of different lengths, 
and eventually, we can use only one index in subsequence matching [15] . Using 
the LD-windows we next present a lower bounding theorem for the index-based 
matching solution and formally prove its correctness. Based on this lower bound- 
ing theorem, we then propose the index building and subsequence matching al- 
gorithms, respectively. We finally showcase the superiority of our index-based 
solution through extensive experiments. Experimental results show that, com- 




(a) Original time-series before linear detrending. 



(b) New time-series after linear detrending. 
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pared with the sequential scan, our solution improves the matching performance 
by one or two orders of magnitude. 

2 Related Work 

Similar sequence matching can be classified into whole matching and subse- 
quence matching [13 . The whole matching |1I4) finds data sequences similar to 
a query sequence, where the lengths of data and query sequences are all identi- 
cal. The subsequence matching [5||7I11I12I14I15] finds subsequences, contained in 
data sequences, similar to a query sequence of arbitrary length. Likewise, subse- 
quence matching is a generalization of whole matching |5I13) , and we thus focus 
on subsequence matching in this paper. 

Many efficient solutions have been proposed for subsequence matching [517113] . 
These solutions consists of index-building and subsequence matching algorithms. 
In the index-building algorithm, the solution constructs an i?*-tree as follows: 
it divides data sequences into multiple windows of size w; transforms those win- 
dows to /(<C a;)-dimensional points using the lower-dimensional transformation 
such as discrete Fourier transform (DFT) and piecewise aggregate approxima- 
tion (PAA); and stores the points (or minimum bounding rectangls (MBRs) con- 
taining multiple points) into the R*-tree. In the subsequence matching algo- 
rithm, the solution finds similar subsequences as follows: it divides the query 
sequence into multiple windows of the same size uj; transforms each window to 
an /-dimensional point; makes a range query using the point and the tolerance; 
constructs a candidate set by searching the _R*-tree; and finally obtain actual 
similar subsequences by eliminating false alarms through the post-processing 
step [IMS]. 

Representative distortions embedded in time-series are offset translation, am- 
plitude scaling, noise, and linear trend |6|9j . In similar sequence matching, there 
have been many efforts to remove these distortions from time-series data. For 
example, offset translation and amplitude scaling can be solved by the nor- 
malization transform, and its subsequence matching solutions were proposed in 
|11|15|12] . Also, the moving average transform can alleviate noise of time-series, 
and its subsequence matching solution was proposed in jl4j . To our best knowl- 
edge, however, there is no solution to linear detrending subsequence matching, 
and in this paper we define the problem first and present an efficient index-based 
solution. 

3 Linear Detrending Subsequence Matching 
3.1 Problem Definition 

For a time-series, its linear trend is a straight line that most likely reflects its 
directionality. The least square method is most widely used to obtain the line of 
a time-series [8] . For a sequence X = {^[1], . . . ,X[n]}, a linear function by the 
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least square method is given by g{k) = afc + /3, where a and /3 are obtained by 
Eq. © [5]. 



n 



a 



n n 

Linear detrending is the process of obtaining a new time-series from an original 
time-series by removing the corresponding linear trend. The following is the 
formal definition of linear detrending. 

Definition 1. For a sequence X = . . . and its trend line g(fc) = 

ak+/3, the linear detrending sequence of X, LD-sequence of X in short, is defined 
BisX^ {X[l],...,X[n]}, where X[k] ^ X[k] ~g{k),k= 1,2, □ 

Linear detrending is simply solved in whole matching, but it is a challenging 
problem in subsequence matching. In whole matching, the lengths of data and 
query sequences are all identical, and we simply use the whole matching solu- 
tion [I] after linear detrending of all time-series. In contrast, the solution is not 
simple in subsequence matching by the following reasons: (1) data subsequences 
in different positions have different linear trend even though they have the same 
length; and (2) data subsequences of different lengths also have different linear 
trend even though they start from the same position. Therefore, we need to 
consider different linear trend for all possible query lengths and for all possi- 
ble positions, and we cannot use the traditional whole/subsequence matching 
solutions for linear detrending subsequence matching. 

We formally define the problem of linear detrending subsequence matching. 
We first present similarity of two sequences by considering the linear detrending. 



Definition 2. For two sequences X and Y of the same length and their LD- 
sequences X and Y, we define that X and Y (or X and Y) are LD-similar if 
the Euclidean distance between X and Y is less than or equal to the tolerance 
e, i.e., iiD(X,Y) < e. □ 

Using the concept of LD-similarity, we now define the problem of linear detrend- 
ing subsequence matching as follows: 

Definition 3. For a data sequence 5, a query sequence Q, and the tolerance e, 
linear detrending subsequence matching is the problem of finding all subsequences 
S[i : j] which are LD-similar to Q, i.e., finding all subsequences S[i : j] such that 
D{Q,S[i:3])<e. □ 



3.2 Sequential Scan-based Solution and Its Problems 

Sequential scan accesses every subsequence S[i : j] sequentially and investigates 
its LD-similarity by computing D{Q, S[i : j]). Algorithm [T] shows the sequential 
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Algorithm 1 LDSeqScan(data sequence S, query sequence Q, tolerance e) 

1: Compute a trend line g{k) from Q using the least square method; 

2: Obtain Q from Q and g{k) through linear detrending; 

3: for each subsequence S[i : j] of length Len{Q) do; 

4: Compute a trend line g'{k) from S[i : j] using the least square method; 
5: Obtain S[i : j] from S[i : j] and g'{k) through linear detrending; 
6: Return the subsequence S[i : j\ if D(Q, S[i : j]) < e; // LD-similar 

7: end-for 



scan algorithm, LDSeqScan, which is simple and self-explained. Algorithm LDSe- 
qScan accesses all possible subsequences one by one in Lines 3 to 7 and returns 
LD-similar subsequences by investigating the LD-similarity. 

The sequential scan algorithm has an advantage of simplicity, but has a dis- 
advantage of incurring severe CPU and I/O overhead. First, the algorithm causes 
many disk accesses since it accesses an entire data sequence in a database. Sec- 
ond, the algorithm also causes severe CPU overhead since it investigates the LD- 
similarity for every individual subsequence by performing the linear detrending 
and by computing the Euclidean distance. This CPU and I/O overhead makes 
LDSeqScan impractical for a large time-series database. To solve this problem, 
we propose an efficient index-based solution in the next Section [5751 

3.3 Index-based Solution and Its Algorithms 

As in the traditional subsequence matching |5|7|13) , we use the window construc- 
tion mechanism that divides data and query sequences into disjoint/sliding win- 
dows of the fixed size. However, our solution quite differs from the traditional 
ones in constructing windows due to use of linear detrending. Each window 
should be mapped to multiple windows in the linear detrending subsequence 
matching while it does not in the traditional subsequence matching. This is 
because, in linear detrending subsequence matching, each window has multiple 
trend lines by different lengths and different positions of subsequences that in- 
clude the window. Formally speaking, for a given window S[a : 5], there are many 
different subsequences S[i : j]'s that include S[a : b]; their trend lines are also 
different from each other; and the window S[a : b] is mapped to multiple windows 
due to different trend lines. We call this complex property the multiple mapping 
property, which was already presented in the normalization-transformed subse- 
quence matching [TS] . The traditional subsequence matching solutions [ 517113] do 
not have the multiple mapping property, but we need to support this property 
in linear detrending subsequence matching. 

To support the multiple mapping property, for a given window, we do not 
remove the linear trend of the window itself, but we instead remove the linear 
trend of a subsequence including that window. To this end, we present a notion 
of LD- windows as follows: 

Definition 4. Suppose S[i : j] be a subsequence of a sequence S, g{k) be a 
linear function of S[i : j], and S[a : b] he a, window included in S[i : j], then 
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LD-windows of S[a : b] against S[i : j], denoted by S^^i jj[a : b], is defined as a 
new window whose entry S{ijj[k]{k — a, a + I, . . . ,b) is set to S[k] — g{k). □ 

Definition|4]nieans that a window S[a : b] is mapped to an LD- window S^ijy [a : b] 
by the trend hne of a subsequence S[i : j] which includes S[a : b]. Because of the 
multiple mapping property, there are many subsequences S[i : j]'s that include 
^[a : 6], and thus, each window ^[a : b] is mapped to multiple LD-windows 
S'{ij}[a : &]'s for different subsequences S[i : j]'s. 

Like the traditional subsequence matching algorithms j5|7|13j , our index- 
based solution first transforms each high- dimensional window to a low- dimensional 
point and then stores the point into the multidimensional index. Unlike the tra- 
ditional algorithms, however, our solution maps each high-dimensional window 
to a low-dimensional MBR that bounds multiple low-dimensional points. This 
is due to the multiple mapping property that a window is mapped to multiple 
LD-windows. Constructing an MBR from a window is performed as follows: (1) 
the given window is mapped to multiple LD-windows; (2) LD-windows are trans- 
formed to low-dimensional points by the lower-dimensional transformation; and 
(3) a low-dimensional MBR is constructed by bounding the transformed points. 
We call this MBR LD-MBR and formally define it as follows: 

Definition 5. Suppose s be a window of a sequence 5", S be {s | s is an LD 
window of s}, and T{-) be a function of lower-dimensional transformation, then 
LD-MBR of s, denoted by M(T(S)), is defined as a low-dimensional MBR that 
bounds all low-dimensional points T(s) for all s S S. □ 

Figure [2] shows the process of constructing an LD-MBR of a window S\a : b\. 
The process is as follows: (1) the window S[a : b] is mapped to multiple LD- 
windows S!^ij^;j^}[a : b\s by considering possible subsequences S[ik ■ j/c]'s; (2) 
each LD-window is transformed to a low-dimensional point; and (3) an LD-MBR 
is constructed by bounding those points. 

Our index-based solution is developed from the following Theorem [TJ 

Theorem 1. For a query sequence Q, a data subsequence S[i : j\, a tolerance e, 
a function T(-) of lower- dimensional transformation, if Q and S[i : j] are LD- 
similar, that is, if D{Q, S[i : j]) < e, the distance between Tilfk) and M (T(§fc)) 
< c/y^, where sT, . . . , and qi, . . . ,q^ are p disjoint windows of Q and S[i : j], 
respectively, and Sfe is the set of LD-windows of Sk ■ That is, the following Eq. 
^ holds: 

p 

D(Q,W^]) < e =^ V ^(r(gfe),M(r(§fc))) < e/VP- (2) 

k=l 

Proof: The proof is similar to that of normalization-transformed subsequence 
matching in the previous work 15 . Refer to the proof of Theorem 1 in |15) for 
the detailed proof. □ 

Theorem [T] guarantees correctness of our index-based solution to linear detrend- 
ing subsequence matching. Like the traditional subsequence matching solutions, 
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Multidimensional index Construction of an LD-MBR 

Fig. 2. Process of constructing an LD-MBR of a window S[a : b]. 



our solution also consists of two algorithms: (1) the index-building algorithm 
and (2) the subsequence matching algorithm. 

Algorithm [2] shows the index-building algorithm. In Line 2 we divide the 
given data sequence into sliding or disjoint windows of size uj. For the first 
subsequence matching solution of [S], we use the sliding window; in contrast, for 
the recent Dual Match [13], we use the disjoint window. In Lines 4 to 14, we build 
a multidimensional index by repeating the following three steps for each window 
S[a : b]: (1) compute trend lines of all possible subsequences (Line 8); obtain 
LD- windows using those trend lines (Line 9); and (3) map those LD- windows to 
an LD-MBR (Line 10). After obtaining an LD-MBR from a window, we store 
it into the index with its starting offset a (Line 13). Once we build an index 
by Algorithm Buildlndex, we use it repeatedly in the subsequence matching 
algorithm. 

Algorithm [3] shows the subsequence matching algorithm. In Line 2 we first 
eliminate the linear trend from the query sequence Q. In Line 3 we divide the 
LD sequence Q into disjoint or sliding windows q of size uj. For the first solution 
of [5], we use the disjoint window; in contrast, for Dual Match [T3] , we use the 
sliding window. In Lines 5 to 11, we find candidate subsequences by repeating 
the following steps for each query window q: (1) transform a high-dimensional 
window g to a low-dimensional point (Line 6); (2) make a range query using that 
point and the given tolerance (Line 7) ; and (3) find candidate subsequences by 
evaluating the range query on the index (Lines 8 and 9). After obtaining the 
candidate subsequences, we finally perform the post-processing step [11517113] 
to identify true LD-similar subsequences by accessing actual subsequences and 
eliminating false alarms. 
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Algorithm 2 Buildlndex(data sequence S) 



1: Let the window size be cj and the maximum/minimum query lengths be Imin, Imia 
2: Divide S into windows of size lj; 

3: // sliding windows for [5]; disjoint windows for Dual Match [13|. 
4: for each window S[a : b] in S do 

5: Make an /-dimensional MBR M which is initially empty; 
6: for each query length / e [IminjTuax] do 

7: for each subsequence S[i : j] of length I that includes S[a : b] do 

8: Compute a trend line of S[i : j] based on the least square method; 

9: Obtain the LD-window S^ij^la : b]; // linear detrending 

10: Transform S^ijy [a : b] to an /-dimensional point and include it into M; 

11: end-for 

12: end-for 

13: Make a record <M, offset — a> for S[a : b], and store it into the index; 
14: end-for 



Algorithm 3 SubsequenceMatching (query sequence Q, tolerance uj) 

1: Let the window size he uj; / / uj is the same one used in Algorithmic] 
2: Obtain Q from Q by eliminating the linear trend; 
3: Divide Q into windows of size lj; 

4: // disjoint windows for !5 ; sliding windows for Dual Match [13| . 
5: for each window q do 

6: Transform q to an /-dimensional point; // lower-dimensional transformation 
7: Construct a range query using that point and t/y/p; 
8: // p is the number of included windows in Q |13| . 

9: Evaluate the query on the index and find the record of the form <M, a>; 
10: Include in the candidate set the subsequences S[i : j] obtained from <M,a>; 
11: end-for 

12: Perform the post-processing step [1I5I7I13| to eliminate false alarms; 



4 Experimental Evaluation 
4.1 Experimental Setup 

We have performed experiments using three real data sets, which also used in the 
previous work |9j . A data set consists of a long data sequence and has the same 
effect as the one consisting of multiple data sequences |5I13| . The first data set 
contains electrocadiogram (ECG) data, and we call this data set ECG-DATA. 
The second data set shows tax growth rates, and we call this data set TAX- 
DATA. The third data set contains exchange rate data, and we call this data 
set EXCH-DATA. The length of each data set is 100,000, that is, each data set 
consists of 100,000 entries (time points). 

In the experiment we have compared two matching solutions: (1) LDSeqScan, 
a sequential scan solution presented in Section 3.2; (2) an index-based match- 
ing solution proposed in Section 3.3. We have adopted the first subsequence 
matching solution of [S] and implemented our index-based approach to that 
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subsequence matching solution. For simplicity, we call this index-based solution 
LDIndexMatch. We have performed two different experiments. In the first ex- 
periment we set the window size and the selectivit}!^ [TTT to 256 and I0~'^, 
respectively, and vary the query sequence length from 256 to 1024. In the second 
experiment we set the window size and the query sequence length to 256 and 
512, respectively, and vary the selectivity from 10~^ to 10""*. We obtain the de- 
sired selectivity by controlling the tolerance e [T3] . As the metric of efficiency, we 
measure the elapsed time of each solution. We generate query sequences from the 
data sequence by taking subsequences of length Len{Q) starting from random 
offsets |5I13I15) . To avoid effects of noise, we experiments with 20 different query 
sequences of the same length and use their average as the result. 

The hardware platform was SUN Ultra 25 workstation equipped with Ultra- 
SPARC Illi 1.34GHz CPU, 1.0GB RAM, and an 80GB hard disk; its software 
platform was Solaris 10. We used C/C++ language for implementing two match- 
ing solutions. In LDIndexMatch, we used PAA [719] as the lower-dimensional 
transformation and extracted eight features from an window of size 256. As 
the multidimensional index, we used the R*-tree [115113] for LDIndexMatch. 



4.2 Experimental Results 

Figure [3] shows the results of the first experiment that uses different lengths of 
query sequences. We first note that, in Figure|3lja) of ECG-DATA, LDIndexMatch 
significantly outperforms LDSeqScan. This means that the notion of LD- windows 
works properly, and it prunes many unnecessary accesses on subsequences at the 
index level. As shown in Figure [3lja), as the query sequence length decreases, 
the performance difference between two solutions becomes larger. For example, 
compared with LDSeqScan, LDIndexMatch reduces the elapsed time by 38.0 times 
for the query sequence of length 1024; in contrast, it reduces the elapsed time 
by 1.60 times only for the query sequence of length 256. This is explained by the 
window size effect [T5] that the performance of index-based solutions decreases 
as the query sequence length on the given window size increases. We can solve 
this problem by using the index interpolation technique |11] which uses multiple 
indexes (for multiple window sizes) to obtain the better performance. Figures 
Mh) and|3i;c) of TAX-DATA and EXCH-DATA show the very similar trend with 
Figure [Slja) of ECG-DATA. It means that the proposed LDIndexMatch exploits 
the pruning effect efficiently, regardless of data types. In summary of Figure [31 
our index-based solution, LDIndexMatch, improves the overall performance by 
1.57 to 38.0 times compared with the straightforward solution, LDSeqScan. 

Figure [H shows the results of the second experiment that uses difference selec- 
tivities (i.e., different tolerances). As in Figure[31 LDIndexMatch also outperforms 
LDSeqScan in all selectivity ranges of Figure [4] We note that, in Figure [4][a) of 
ECG-DATA, the performance difference between LDIndexMatch and LDSeqScan 
increases as the selectivity decreases. This is because, as the selectivity decreases, 
the number of candidate subsequences also decreases. More precisely, as shown 

1 Cj^l^^^-f" "f the number of subsequences that arc LD-similai' with the query sequence 

^ the number of all possible subsequences in the database 
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Fig. 3. Experimental results by varying the query sequence length. 

in Lines 7 to 10 of Algorithm 3, the smaller selectivity incurs the smaller number 
of candidate subsequences at the index level, and this reduces the false alarms 
that cause the disk accesses and the expensive computations on the actual subse- 
quences. As in Figure[3l Figuresgtb) andlUc) of TAX-DATA and EXCH-DATA 
show the very similar trend with Figure HJa) of ECG-DATA. Figure |4] demon- 
strates that LDIndexMatch beats LDSeqScan in all selectivity ranges, and this 
means that LDIndexMatch does not much depend on the selectivity values. 

5 Conclusions 

In this paper we introduced a new problem of linear detrending subsequence 
matching and proposed an efficient index-based solution. Contributions of the 
paper are summarized as follows. First, we formally defined the linear detrending 
subsequence matching and presented its sequential scan-based solution. Second, 
we presented a novel notion of LD -windows, and using LD-windows we proposed 
an index-based solution. We here formally proved correctness of our index-based 
solution. Third, we described the index-building and subsequence matching algo- 
rithms of the index-based solution. Fourth, we showcased that, compared with 
the straightforward sequential scan, our index-based solution significantly im- 
proved the matching performance by one or two orders of magnitude. We believe 
that the linear detrending subsequence matching and its index-based solution 
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Fig. 4. Experimental results by varying the selectivity. 



will be very helpful to find meaningful time-series patterns hidden by the linear 
trend. 
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